Unsupervised Learning: Trade&Ahead

Marks: 60

Context

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary

Notes to reviewer

From time to time, you will find the %%time command included in some cells. It is there because I was interested in knowing the execution time of some cells, especially the cells than run complex commands. The command is not critical to the final outcome, but interesting to know.

For the bar plots, I printed the totals and percentages as well as the bar plot as I like to see the totals and percentages.

I am not the world's greatest speller so there will be spelling mistakes. Jupyter really needs a spell checker.

Importing necessary libraries and data

Function Definitions

I am old fashioned and was originally trained in procedural code. I like defining functions at the top of the notebook. I also find it is better to have everything defined upfront so you can define and load everything all at once. The flow of the code is also not interrupted with a function defintion.

Load Data

Data Overview

Notes:

A complete data set

Notes

========================================================================================================================

Exploratory Data Analysis (EDA)

Questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

Univariate Analysis

Plotting histograms and boxplots for all the variables at one go

Notes:

It is about even when it comes down to which metrics are right skewed and which are realitivly normally distributed

GIS Sectors with highest growth rates

Notes

The sectors with the biggest growth are:

The sector with the least growth is:

Notes

The subsectors with the largest growth are: </br>

* Note: growth is defined as Growth >= 0.020 

The subsectors with middle level growth are: </br>

* Note: growth is defined as Growth >= 0.01 and < 0.20 

All other sectors are low growth. </br>

* Low growth is defined as Growth < 0.09

Bivariate Analysis

Notes:

These variables have a high correlation:

These variables have a low correlation:

All the other variables have a low positive or low negative correlation. This is not surprising that Volatility shows up so many times the low correlation list. Markets, in general, do not like volatility situation. This can be seen from the events of 2020 though 2022.

Notes:

This chart reinforces what has been see in the heatmap and other graphs. The variables are not completely independent of each other and the KDE is either right skewed or approximately normally skewed.

Check the stocks of which economic sector have seen the maximum price increase on average

Notes:

Energy seems counter intuitive given the focus on renewalable and green energy

Average cash ratio varies across economic sectors

Notes:

The sectors with the highest average cash ratios across the sectors are:

These sectors will have the greatest chance of meeting their financial obligations, like payroll, from existing cash and cash equivalents without having to generate a debit instrument like Commercial Paper.

The sectors with the lowest average cash ratios across the sectors are:

These sectors will have the lowest chance of meeting their financial obligations, like payroll, from existing cash and cash equivalents without having to generate a debit instrument like Commerical Paper

Net cash flow varies across economic sectors

Notes:

As for actual cash flow across each sector, these sectors have the lowest net cash flow to expenses:

The real estate sectors is just breaking even. </br> All other sectors have a positive cash flow

Average P/E ratio variance across economic sectors

Notes:

The price to earnings ratios are the highest in these sectors:

This means an investor has the greatest chance of getting a positive return in investment if they were to invest in these sectors.

The price to earnings ratios are the lowest in these sectors:

This means an investor has the lowest chance of getting a positive return in investment if they were to invest in these sectors.

Volatility variance across economic sectors

Notes:

======================================================================================================================

Data Preprocessing

Notes:

Notes:

The outliers will not be treated as treating them with skew the data to the point the data is not longer valuable

=====================================================================================================================

EDA

Plotting histograms and boxplots for all the variables at one go

Notes

=====================================================================================================================

Scaling and Create DataFrame

Notes:

Notes:

========================================================================================================================

K-means Clustering

Checking Elbow Plot

Notes:

For the silhouette scores, I am going to compute the silhouette scores for the same numbers, plus I am going to compute the score for 10 clusters

* I know 10 clusters is incorrect, but I wanted to see what a bad example looked like 

Checking the silhouette scores

Notes:

Creating Final Model

Notes:

Notes

Given these reasons, I would proceed with N_clusters at 4

====================================================================================================================

Hierarchical Clustering

Computing Cophenetic Correlation

Different linkage methods with Euclidean distance only

Notes

Checking Dendrograms

Notes

Notes

The Dendrograms show that the Euclidean distance and average linkage is the best at a Cophenetic Correlation at 0.94

Creating Model using sklearn

Notes:

Cluster Profiling

Notes:

The numbers should be showing up with yellow highlights, but for some reason they are not

Notes

Notes

Notes:

Notes

K-means vs Hierarchical Clustering

You compare several things, like:

You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.

Notes:

Business Insights

Conclusions

Future Recommendations